Exploring edX data during 2012 to 2013 by Kan-Hua Lee

Univariate Analysis

What is the structure of your dataset?

This dataset has 27 columns and 641,138 rows. Each row is the statistical results per user per course. This dataset includes 16 courses. Three of those are the same course material that offered two different times. Therefore 476,549 unique registrants of all courses in this dataset.

## [1] 641138     27
## [1] 476549      9

What is/are the main feature(s) of interest in your dataset?

We mainly interested in the backgrounds and activities of the students who attended and earned certification of each course. In particular, we would to find the relation between these information affects the performances of the students. The features in this dataset that we will focus on are: registered, explored, LoE_DI, YoB, gender, and nevents. The response that we are interested in is certified and grade.

What other features in the dataset do you think will help support your investigation into your feature(s) of interested?

In this analysis we mainly use nevents as the main measure of how much efforts that a student pays for the course. Other features such as ndays_act, start_time_DI, last_event_DI and nforum_posts will also be used to support this investigation.

Did you create any new variables from existing variables in the dataset?

The following new variables are created in the original dataframe edxdata:

  • age: the age of the user when taking the course. It is calculated by 2013-YOB.
  • access.period: Number fo days between last_event_DI and start_time_DI.
  • access.rate: ndays_act divided by access.period. This variable measures how often an user accesses the course.

Also, we grouped the raw data by the following a number of features and created new variables for each dataset for each new data sets:

users

The data frame grouped and summarised the data by each user.

  • course_taken: number of courses viewed.
  • total_registered: number of courses registered.
  • total_explored: number of courses explored.
  • user.certificates: number of courses certified.
courses

We grouped and summarised the data by course_id. The following new variables are created:

  • passed_num: total certified users of the course.
  • explored_num: total users who explored the course.
  • registered_num: total users who registered the course.
  • total_nforum_posts: total number of posts in the course.
  • pass.rate : the number of certificated users divided by the number of registered users
  • hangon.rate : the number of explored users divided by the number of registered users

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

A few preprocessings of the raw data were performed, which are listed below

  • Transform the feature LoE_DI into levels.
  • Transform the features certified, explored and viewed into logical data type.

Univariate Plots Section

Firstly we investigate some basic user statistics of each course:

Number of registrants of each course:

Number of registratns who “explored” the course:

Number of registrants who earned certificates:

pass.rate of each course:

We then investigate the statstics of users:

Number of certificates earned by the registrants:

Please note that the y-axis of the above graph is in log scale.

Histogram of age among all registrants:

Although the distribution of age is very wide, most registrants are between 20 to 35 years old.

Histogram of LOE_DI of all registrants, with NA and blank (“”) filtered:

LoE_DI are dominated by Less than Secondary, Master’s and Doctorate.

Histogram of gender of all registrants, with NA and blank (“”) filtered:

Next, we investigates the activities of the registrants using the feature nevents.

Histogram of nevents of all registrants:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Since the histogram is dominated by small value of nevents, we only plot the data that nevents<10000 in order to shows the trend clearer.

Histogram of nevents of registrants that passed a course (certified==1)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Within the group of certified==1, the distribution of nevents becomes more evenly distributed between 0 and 10000.

Bivariate Plots Section

Investigating the background of the registrants

Gender distribution of all registrants, by course

The figure shows that most of the courses are dominated by male registrants.

Gender distribution of certificated registrants, by course

The certificated registrants of most courses are also dominated by men, except Poverty and HealthStat.

The distribution LoE_DI, by course:

The above figure shows that the population of registrants with Bachelor’s and Secondary degree are reliatively small in all courses. Note that the x-axis is in log scale.

Investigating the activities of registrants

Normalized distribution of nevents of all registrants, by course
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The distribution of most of the courses are very similar. All the curves drop very sharply below 100 nevents. After that, the decrease is more mild.

Distrubution of nevents of all registrants with certificates, by course
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The distribution of nevents becomes more Gaussian if we only consider the population of those with certificates. However, the peaks of these distributions vary from course to course.

Boxplot of nevents of registrants with certificates, by course

This plot is a different representation of the previous plot. This boxplot shows the median and variations of nevents more clearly. Although the median of values of nevents of each course vary a lot, most courses have median nevents values around a few thousands. The course EM has the highest median value of nevents, while the course CSH has the lowest value.

In the next three graphs, we plot similar boxplots of ndays_act, access.period and access.rate instead of nevents.

Boxplot of ndays_act of registrants with certificates, by course

Boxplot of access.period of registrants with certificates, by course

Boxplot of access.rate of registrants with certificates, by course

We can see that ndays_act, access.period and access.rate all vary a lot between each course, as we observed in the boxplot of nevents.

We pick the user activities features nevents and ndays_act, and explore their relation:

ndays_act against nevents among the “explored” registrants (explored==1):

From this plot, we can see that n_days_act and nevents have strong correlation. Also entries with more nevents and n_days_act have higher chance to be certified.

In the last part of this section, we explore some other features.

total_nforum_posts against pass.rate of each course

Since total_nforum_posts might be an indicator of support offered by the community. One may assume that large amount of total_nforum_posts may therefore help the pass.rate. However, the we did not see such relation in the above plot.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  1. In every course except Proverty(The Challenges of Global Proverty), registratns with certificates are dominated by male.

  2. Registrants with certificates are largely dominated by those who do not have a Secondary school degree (less tha Secondary) or hold an advanced degree (Master and PhD). This trend exists in every course in this dataset.

  3. The number of nevents is exponentailly distributed among all registrants, but the distribution becomes more Gaussian when only certificated registrants are included.

  4. The variation of nevents among certificated users of each course is large. The average of nevents of each course is typically around a few thousand times.

  5. The course EM (Electricity and Magnetism) has highest mean values of nevent, access.rate and ndays_act across all courses, which suggests that it may be the most demanding course within these 16 courses.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  1. total_nforum_posts is not strongly correlated to pass.rate, indicating that the support from the community in the forum is not a key factor for pass.rate.

  2. Most users has 100 nevents per n_day_act, as shown in the figure of nevents against n_day_act. It also shows that more nevents and n_day_act has higher chance to pass the course.

What was the strongest relationship you found?

Within these 16 courses, registrants who passed these courses are dominated by male. Also, the certificated students are dominated by those do not have a Secondary school degree (less tha Secondary) or hold an advanced degree (Master and PhD).

Multivariate Plots Section

Distribution of grade of each course
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

nevents against grade, by course

access.rate against grade, by course

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From these figures, we found that the correlation between grade and nevents or between grade and access.rate are not very strong. However, courses such as EM or CSM13 does show some weak trends that more access.rate or nevents results in higher grade.

Were there any interesting or surprising interactions between features?

The distribution of grade behaves differently from course to course. Two different patterns are commonly seen:

  1. M-shape: Two main peaks exist in the distribution. One is in the region below the pass grade, while another one is in the region above pass grade. Biology, CSM12, CSM13, Poverty has M-shape distribution.

  2. U-shape: Most of the population accumlated at both ends of the range. The distribution of Circuits12, Circuits13 and JusticeX has this characteristics.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

This figure is the density plot of education level (LOE_DI) of the registrants by each course, with and without certificates. TRUE are the registrants with certificates, whereas FALSE are the registratns without certificates. Data entries with LOE_DI==NA or "" are ignored. As shown in this figure, the registrants of each course are dominated by the participants who hold an advanced degree or less than Secondary degree. The population of users who hold a secondary or Bachelor’s degree are small in every course. In addition, this plot shows that composition of education level does not vary a lot bewteen the population with and without certificates.

Plot Two

Description Two

This figure shows a boxplot and a scatter plot of nevent of each certified participant versus each course. The upper, middle and lower hinges of the boxplot represent the 25th, 50th 75th quantiles ofnevents,respectively. nevents records the number of interactions with the course and therefore can be an indication of the efforts required to pass a course. The boxplot helps comparing the median and spread of nevents between each course. For example, the median and variation of nevents of the course CSH is very small compared to other courses. On the other hand, the course EM has the highest median nevents value among courses, indicating this course needs more efforts to complete.

Plot Three

Description Three

The figure shows grade against nevent of each course among “explored” registrants (explored==1). Different colors in the plot separate the registrants who earned the certificates or not. This set of figures shows the correlation between grade and nevent are weak. However, courses such as CSM13 or EM shows slightly stronger correlation between grade and nevent.


Reflection

In this study we analysed which and how the features affect the perfomance of registrants in a MOOC course. In most of the analysis, we mainly choose the “explored” registrants (explored==1) as the sample space. Including all the registrants of each course may make the results very biased because most of the registrants do not involve a lot in the courses. It is worth noting that a few percents certificated registrants are not “explored” registrants. These exceptional few are neglected in these analyses. In addition, we choose to use certified and grade to gauge the performance of the user, but different grading policis and requirements for certificates make it difficult to find universal trends between each course.

Through these data visualizations, we found some significant trends in gender and level of education among the registrants who earned certificates. However, we struggled to find strong correlations between the performance against features such as nevents, n_days_act or access.rate, because all the participants have different level of background and targets for exploring a course, which is also an attractive nature of MOOC.

It is possible to explore this dataset further by adopting some statistical learning methods such as regressions or decision trees to predict certified or grade. Also, the correlations between users’ countries (final_cc_cname_DI) and other features would also be interesting to look into.